Main Analysis
Provide a detailed, well-organized description of your findings, including textual description, graphs, and code. Your focus should be on both the results and the process. Include, as reasonable and relevant, approaches that didn’t work, challenges, the data cleaning process, etc.
• The guidelines for the Executive Summary above do NOT apply to exploratory data analysis. Your main concern is designing graphs that reveal patterns and trends.
• As noted in Hmk #4, do not use circles, that is: bubbles, pie charts, or polar coordinates.
• Use stacked bar charts sparingly. Try grouped bar charts and faceting as alternatives, and only choose stacked bar charts if they truly do a better job than the alternatives for observing patterns.
Data Cleaning
Since the data is very messy, we put many effort on cleaning and extract useful infomation for analysis.
- Convert to correct type
- Consolidate name, region, date
Join same region
region_str <- "africa|asia|canada|latin america (excl mexico)|europe|mexico|middle east|oceania"
inbound_region <- tidy_ntto_inbound_m %>%
filter(grepl(region_str, MixRegion)) %>%
select(Region=MixRegion, Year, Date, Inbound) %>%
group_by(Region, Year, Date) %>%
summarise(TotalInbound=sum(Inbound)) %>%
ungroup
outbound_region <- tidy_ntto_outbound_m %>%
select(Region, Year, Date, Outbound) %>%
group_by(Region, Year, Date) %>%
summarise(TotalOutbound=sum(Outbound)) %>%
ungroup
regional_travel <- inner_join(inbound_region, outbound_region,
by=c("Region"="Region", "Year"="Year", "Date"="Date"))
Challenges
There are several challenges in our project: 1. Due to the problems in the data set such as inconsistency, we have to spend much time in cleaning and re-organizing it, which makes the work tedious and laborious. 2. We need country level data in some graphs; however, what we can acquire from the dataset is region level. In that case, we have to project the data onto the whole region, which makes the analysis not comprehensive and detailed. 3. Shiny is a great tool for creating interactive data visualizations in R; however, we do not have much experience in it, and therefore have to spend time learning it, which is not easy in such a short time.
Analysis
We first visualized the trend of inbound for all region, from the plot we can tell that: 1. There’s a growing trend for most regions. 2. Canada has the highest number 3. Inbound is very seasonal, for example, the peak of Canada always happens on July.
p1 <- inbound_region %>%
spread(Region, TotalInbound) %>%
filter(Date>'2008-11') %>%
plot_ly(x = ~as.POSIXct(Date)) %>%
add_trace(y=~africa, name='africa', mode='lines', line = list(color="gray", width = 1)) %>%
add_trace(y=~asia, name='asia', mode='lines', line = list(color="red", width = 1)) %>%
add_trace(y=~canada, name='canada', mode='lines', line = list(color="orange", width = 1)) %>%
add_trace(y=~europe, name='europe', mode='lines', line = list(color="pink", width = 1)) %>%
add_trace(y=~`latin america excl mexico`, name='latin america excl mexico', mode='lines', line = list(color="green", width = 1)) %>%
add_trace(y=~mexico, name='mexico', mode='lines', line = list(color="purple", width = 1)) %>%
add_trace(y=~`middle east`, name='middle east', mode='lines', line = list(color="black", width = 1)) %>%
add_trace(y=~oceania, name='oceania', mode='lines', line = list(color="blue", width = 1))
p2 <- outbound_region %>% spread(Region, TotalOutbound) %>%
plot_ly(x = ~as.POSIXct(Date)) %>%
add_trace(y=~africa, name='africa', mode='lines', line = list(color="gray", width = 1), showlegend=F) %>%
add_trace(y=~asia, name='asia', mode='lines', line = list(color="red", width = 1), showlegend=F) %>%
add_trace(y=~canada, name='canada', mode='lines', line = list(color="orange", width = 1), showlegend=F) %>%
add_trace(y=~europe, name='europe', mode='lines', line = list(color="pink", width = 1), showlegend=F) %>%
add_trace(y=~`latin america excl mexico`, name='latin america excl mexico', mode='lines', line = list(color="green", width = 1), showlegend=F) %>%
add_trace(y=~mexico, name='mexico', mode='lines', line = list(color="purple", width = 1), showlegend=F) %>%
add_trace(y=~`middle east`, name='middle east', mode='lines', line = list(color="black", width = 1), showlegend=F) %>%
add_trace(y=~oceania, name='oceania', mode='lines', line = list(color="blue", width = 1), showlegend=F)
subplot(p1, p2, nrows=2, shareX=T) %>%
layout(title = "Inbound v.s. Outbound",
yaxis = list(title = "Inbound"),
yaxis2 = list(title = "Outbound"),
legend = list(orientation = 'h')
)
Then we move to the Outbound of all regions, from graph belowe we observe several things: 1. Mexico is the number one Outbound country. 2. A huge pump happend on 2009 for Mexico. 3. Outbound number of mexico is increasing, however for other regions it seems stable. 4. Like Inbound, Outbound shows a seasonal pattern as well.
Since Canada and Mexico dominate the number of people and the interest of different behaviour per region, we start looking at inbound and outbound per region.
p1 <- regional_travel %>%
filter(Region=='africa') %>%
plot_ly(x = ~as.POSIXct(Date), height = 1000) %>%
add_trace(y=~TotalInbound, type="scatter", name='Inbound', mode='lines', line = list(color="gray", width = 1)) %>%
add_trace(y=~TotalOutbound, type="scatter", name='Outbound', mode='lines', line = list(color="blue", width = 1)) %>%
layout(autosize=F)
p2 <- regional_travel %>%
filter(Region=='asia') %>%
plot_ly(x = ~as.POSIXct(Date)) %>%
add_trace(y=~TotalInbound, type="scatter", name='Inbound', mode='lines', line = list(color="gray", width = 1), showlegend=F) %>%
add_trace(y=~TotalOutbound, type="scatter", name='Outbound', mode='lines', line = list(color="blue", width = 1), showlegend=F)
p3 <- regional_travel %>%
filter(Region=='canada') %>%
plot_ly(x = ~as.POSIXct(Date)) %>%
add_trace(y=~TotalInbound, type="scatter", name='Inbound', mode='lines', line = list(color="gray", width = 1), showlegend=F) %>%
add_trace(y=~TotalOutbound, type="scatter", name='Outbound', mode='lines', line = list(color="blue", width = 1), showlegend=F)
p4 <- regional_travel %>%
filter(Region=='europe') %>%
plot_ly(x = ~as.POSIXct(Date)) %>%
add_trace(y=~TotalInbound, type="scatter", name='Inbound', mode='lines', line = list(color="gray", width = 1), showlegend=F) %>%
add_trace(y=~TotalOutbound, type="scatter", name='Outbound', mode='lines', line = list(color="blue", width = 1), showlegend=F)
p5 <- regional_travel %>%
filter(Region=='latin america excl mexico') %>%
plot_ly(x = ~as.POSIXct(Date)) %>%
add_trace(y=~TotalInbound, type="scatter", name='Inbound', mode='lines', line = list(color="gray", width = 1), showlegend=F) %>%
add_trace(y=~TotalOutbound, type="scatter", name='Outbound', mode='lines', line = list(color="blue", width = 1), showlegend=F)
p6 <- regional_travel %>%
filter(Region=='mexico') %>%
plot_ly(x = ~as.POSIXct(Date)) %>%
add_trace(y=~TotalInbound, type="scatter", name='Inbound', mode='lines', line = list(color="gray", width = 1), showlegend=F) %>%
add_trace(y=~TotalOutbound, type="scatter", name='Outbound', mode='lines', line = list(color="blue", width = 1), showlegend=F)
p7 <- regional_travel %>%
filter(Region=='middle east') %>%
plot_ly(x = ~as.POSIXct(Date)) %>%
add_trace(y=~TotalInbound, type="scatter", name='Inbound', mode='lines', line = list(color="gray", width = 1), showlegend=F) %>%
add_trace(y=~TotalOutbound, type="scatter", name='Outbound', mode='lines', line = list(color="blue", width = 1), showlegend=F)
p8 <- regional_travel %>%
filter(Region=='oceania') %>%
plot_ly(x = ~as.POSIXct(Date)) %>%
add_trace(y=~TotalInbound, type="scatter", name='Inbound', mode='lines', line = list(color="gray", width = 1), showlegend=F) %>%
add_trace(y=~TotalOutbound, type="scatter", name='Outbound', mode='lines', line = list(color="blue", width = 1), showlegend=F)
subplot(p1, p2, p3, p4, p5, p6, p7, p8, nrows=8) %>%
layout(title = "Regional Inbound and Outbound",
yaxis = list(title = "Africa"),
yaxis2 = list(title = "Asia"),
yaxis3 = list(title = "Canada"),
yaxis4 = list(title = "Europe"),
yaxis5 = list(title = "Latin America"),
yaxis6 = list(title = "Mexico"),
yaxis7 = list(title = "Middle East"),
yaxis8 = list(title = "Oceania"),
legend = list(orientation = 'h', x = 0, y = 1.005)
)
yearly_spend <- tidy_ntto_spend_y %>%
filter(Region!='european union', Region!='south-central america', Region!='overseas') %>%
mutate(Region=recode(Region, "asia-pacific"="asia"), Spend=Spend*1000000) %>%
select(-Missing) %>%
arrange(Region, Year, Type, Category)
yearly_spend %>%
group_by(Type, Year) %>%
summarise(TotalSpend=sum(Spend)) %>%
spread(Type, TotalSpend) %>%
ungroup %>%
plot_ly(x = ~Year) %>%
add_trace(y=~`Payments (imports)`, type="scatter", name='Payments (imports)', mode = 'lines+markers', line = list(color="blue", width = 1)) %>%
add_trace(y=~`Receipts (exports)`, type="scatter", name='Receipts (exports)', mode = 'lines+markers', line = list(width = 1)) %>%
layout(title = "Yearly Spending",
xaxis = list(title = "Year"), yaxis = list(title = "Spend"),
legend = list(orientation = 'h', x = 0.5, y = 1.005))
yearly_spend %>%
group_by(Type, Year) %>%
summarise(TotalSpend=sum(Spend)) %>%
spread(Type, TotalSpend) %>%
plot_ly(x = ~Year, y = ~`Payments (imports)`, type = 'bar', name = 'Payments', marker = list(color = 'rgb(55, 83, 109)')) %>%
add_trace(y = ~`Receipts (exports)`, name = 'Receipts', marker = list(color = 'rgb(26, 118, 255)')) %>%
layout(title = 'Yearly Spending',
xaxis = list(
title = "",
tickfont = list(
size = 14,
color = 'rgb(107, 107, 107)')),
yaxis = list(
title = 'Spend (Billion $)',
titlefont = list(
size = 16,
color = 'rgb(107, 107, 107)'),
tickfont = list(
size = 14,
color = 'rgb(107, 107, 107)')),
# legend = list(orientation = 'h', x = 0.5, y = 1.005)
legend = list(orientation = 'h', x = 0, y = 1, bgcolor = 'rgba(255, 255, 255, 0)', bordercolor = 'rgba(255, 255, 255, 0)'),
barmode = 'group', bargap = 0.15, bargroupgap = 0.1)
'layout' objects don't have these attributes: 'bargroupgap'
Valid attributes include:
'font', 'title', 'titlefont', 'autosize', 'width', 'height', 'margin', 'paper_bgcolor', 'plot_bgcolor', 'separators', 'hidesources', 'smith', 'showlegend', 'dragmode', 'hovermode', 'xaxis', 'yaxis', 'scene', 'geo', 'legend', 'annotations', 'shapes', 'images', 'updatemenus', 'ternary', 'mapbox', 'radialaxis', 'angularaxis', 'direction', 'orientation', 'barmode', 'bargap', 'mapType'
'layout' objects don't have these attributes: 'bargroupgap'
Valid attributes include:
'font', 'title', 'titlefont', 'autosize', 'width', 'height', 'margin', 'paper_bgcolor', 'plot_bgcolor', 'separators', 'hidesources', 'smith', 'showlegend', 'dragmode', 'hovermode', 'xaxis', 'yaxis', 'scene', 'geo', 'legend', 'annotations', 'shapes', 'images', 'updatemenus', 'ternary', 'mapbox', 'radialaxis', 'angularaxis', 'direction', 'orientation', 'barmode', 'bargap', 'mapType'
yearly_spend %>%
filter(Type=="Payments (imports)") %>%
select(-Type) %>%
group_by(Region, Year) %>%
summarise(TotalSpend=sum(Spend)) %>%
spread(Region, TotalSpend) %>%
plot_ly(x = ~Year) %>%
add_trace(y=~africa, type="scatter", name='africa', mode = 'lines+markers', line = list(width = 2)) %>%
add_trace(y=~asia, type="scatter", name='asia', mode = 'lines+markers', line = list(width = 2)) %>%
add_trace(y=~europe, type="scatter", name='europe', mode = 'lines+markers', line = list(width = 2)) %>%
add_trace(y=~`latin america`, type="scatter", name='latin america', mode = 'lines+markers', line = list(width = 2)) %>%
add_trace(y=~`middle east`, type="scatter", name='middle east', mode = 'lines+markers', line = list(width = 2)) %>%
layout(title = "Payments (imports)", xaxis = list(title = "Year"), yaxis = list(title = "Spend"),
legend = list(orientation = 'h'))
yearly_spend %>%
filter(Type=="Receipts (exports)") %>%
select(-Type) %>%
group_by(Region, Year) %>%
summarise(TotalSpend=sum(Spend)) %>%
spread(Region, TotalSpend) %>%
plot_ly(x = ~Year) %>%
add_trace(y=~africa, type="scatter", name='africa', mode = 'lines+markers', line = list(width = 1)) %>%
add_trace(y=~asia, type="scatter", name='asia', mode = 'lines+markers', line = list(width = 1)) %>%
add_trace(y=~europe, type="scatter", name='europe', mode = 'lines+markers', line = list(width = 1)) %>%
add_trace(y=~`latin america`, type="scatter", name='latin america', mode = 'lines+markers', line = list(width = 1)) %>%
add_trace(y=~`middle east`, type="scatter", name='middle east', mode = 'lines+markers', line = list(width = 1)) %>%
layout(title = "Receipts (exports)",
xaxis = list(title = "Year"),
yaxis = list(title = "Spend"),
legend = list(orientation = 'h'))
yearly_spend %>%
group_by(Type, Year) %>%
summarise(TotalSpend=sum(Spend)) %>%
spread(Type, TotalSpend) %>%
plot_ly(x = ~Year) %>%
add_trace(y=~`Payments (imports)`, type="scatter", name='Payments (imports)', mode = 'lines+markers', line = list(color="blue", width = 1)) %>%
add_trace(y=~`Receipts (exports)`, type="scatter", name='Receipts (exports)', mode = 'lines+markers', line = list(width = 1)) %>%
layout(title = "Africa",
xaxis = list(title = "Year"), yaxis = list(title = "Spend"),
legend = list(orientation = 'h', x = 0.5, y = 1.005))
tidy_wb_gdp %>%
filter(CountryCode=="USA", Year>2000, Year<2016) %>%
select(Year, GDP) %>%
mutate(Year=factor(Year)) %>%
plot_ly(x = ~Year) %>%
add_trace(y=~GDP, type="scatter", name='US', mode = 'lines+markers', line = list(width = 1))
yearly_spend %>%
filter(Type=="Receipts (exports)") %>%
select(-Type, -Region) %>%
group_by(Category, Year) %>%
summarise(TotalSpend=sum(Spend)) %>%
spread(Category, TotalSpend) %>%
plot_ly(x = ~Year, y = ~Education, type = 'bar', name = 'Education') %>%
add_trace(y = ~`Medical/Short-Term Workers`, name = 'Medical/Short-Term Workers') %>%
add_trace(y = ~`Other Business/Other Personal Travel`, name = 'Other Business/Other Personal Travel') %>%
layout(yaxis = list(title = 'Spend'), barmode = 'group')
yearly_spend %>%
filter(Type=="Receipts (exports)") %>%
select(-Type, -Region) %>%
group_by(Category, Year) %>%
summarise(TotalSpend=sum(Spend)) %>%
spread(Category, TotalSpend) %>%
plot_ly(x = ~Year, y = ~Education, type = 'bar', name = 'Education') %>%
add_trace(y = ~`Medical/Short-Term Workers`, name = 'Medical/Short-Term Workers') %>%
add_trace(y = ~`Other Business/Other Personal Travel`, name = 'Other Business/Other Personal Travel') %>%
layout(yaxis = list(title = 'Spend'), barmode = 'group')
Finally, we select several region and combine inbound, outbound with GDP